Table of Content

  1. What is Data Frame?
  2. Creating Data Frame
  3. Exploring Data Frame
  4. Displaying Data Frame
  5. Summarizing Data Frame
  6. Selecting Data from Data Frame
  7. Returning Data Frame vs Vectors
  8. More Fun with Data Frame!

1. What is Data Frame?

A data frame is a two-dimensional data structure in R language, consisting of rows and columns. It is a special type of List data structure.

The anatomy of a data frame

Picture Source: https://www.geeksforgeeks.org/dataframe-operations-in-r/

2. Creating Data Frame

Creating Data Frame from Vectors

A data frame can be created from vectors using the c() function to combine data items of the same type. In the example below, a data frame is created from a character vector, a numeric vector, and a date vector.

members <- data.frame(
  Name = c('Ricky', 'Fatimah', 'Kumanan', 'Jamaine'), # Character vector
  Height = c(170, 172, 180, 168), # Numeric vector
  Birthday = as.Date(c('1990-01-01', '1991-02-02', '1993-03-03', '1994-04-04')) # Date vector
)

print(members)
##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03
## 4 Jamaine    168 1994-04-04

Creating Data Frame from Lists

A data frame can be created by converting from lists.

list_numbers <- list('Column 1' = 1:4, 'Column 2' = 5:8, 'Column 3' = 9:12)
df_numbers <- as.data.frame(list_numbers)
df_numbers
##   Column.1 Column.2 Column.3
## 1        1        5        9
## 2        2        6       10
## 3        3        7       11
## 4        4        8       12

Creating Data Frame from Matrix

A data frame can be created by converting from a matrix.

matrix_numbers <- matrix(1:12, nrow = 4, ncol = 3)
df_numbers <- as.data.frame(matrix_numbers)
colnames(df_numbers) <- c('Column 1', 'Column 2', 'Column 3') # Assign column names
df_numbers
##   Column 1 Column 2 Column 3
## 1        1        5        9
## 2        2        6       10
## 3        3        7       11
## 4        4        8       12

Creating Data Frame from File

A data frame can be created by importing data from a file or on the web, such as a comma-separated values (CSV) file using the read.csv() function.

# Import from a CSV file on local computer
audiobooks <- read.csv('audiobooks.csv')

# Import the same CSV file from the web
audiobooks <- read.csv('https://raw.githubusercontent.com/rickysoo/top_audiobooks/main/TopAudiobooks-20201107-122322.csv')

head(audiobooks) # Show first 6 rows
##   Rank            Title                                                Subtitle
## 1    1      Greenlights                                                    <NA>
## 2    2  The Ice Diaries  The Untold Story of the Cold War's Most Daring Mission
## 3    3      Tiny Habits                The Small Changes That Change Everything
## 4    4 A Time for Mercy                                   A Jack Brigance Novel
## 5    5     The Sentinel                                    A Jack Reacher Novel
## 6    6        Clanlands Whisky, Warfare, and a Scottish Adventure Like No Other
##                                                    Author
## 1                                     Matthew McConaughey
## 2    Captain William R. Anderson, Don Keith - contributor
## 3                                             BJ Fogg PhD
## 4                                            John Grisham
## 5                                 Lee Child, Andrew Child
## 6 Sam Heughan, Graham McTavish, Diana Gabaldon - foreword
##                                                  Narrator             Length
## 1                                     Matthew McConaughey  6 hrs and 42 mins
## 2                                           Roger Mueller   10 hrs and 1 min
## 3                                             BJ Fogg PhD 11 hrs and 22 mins
## 4                                            Michael Beck 19 hrs and 59 mins
## 5                                             Scott Brick 10 hrs and 39 mins
## 6 Graham McTavish, Sam Heughan, Diana Gabaldon - foreword 10 hrs and 22 mins
##    Release Language              Stars       Ratings  Price
## 1 10-20-20  English   5 out of 5 stars 7,172 ratings $28.00
## 2 10-15-19  English   5 out of 5 stars     6 ratings $30.79
## 3 01-14-20  English 4.5 out of 5 stars   742 ratings $34.95
## 4 10-13-20  English   5 out of 5 stars 3,534 ratings $31.50
## 5 10-27-20  English 4.5 out of 5 stars 1,394 ratings $31.50
## 6 11-03-20  English   5 out of 5 stars   158 ratings $21.81

3. Exploring Data Frame

There are a number of functions to show the characteristics of a data frame.

# Data frame is a special case of list
typeof(members)
## [1] "list"
# The class is data.frame
class(members)
## [1] "data.frame"
# Check if "members" is a data frame
is.data.frame(members)
## [1] TRUE
# Number of columns
ncol(members)
## [1] 3
# Column names
names(members)
## [1] "Name"     "Height"   "Birthday"
# Number of rows
nrow(members)
## [1] 4
# Row names
row.names(members)
## [1] "1" "2" "3" "4"
# The dimension
dim(members)
## [1] 4 3
# The row and column names
dimnames(members)
## [[1]]
## [1] "1" "2" "3" "4"
## 
## [[2]]
## [1] "Name"     "Height"   "Birthday"
# Internal structure of the data frame
str(members)
## 'data.frame':    4 obs. of  3 variables:
##  $ Name    : chr  "Ricky" "Fatimah" "Kumanan" "Jamaine"
##  $ Height  : num  170 172 180 168
##  $ Birthday: Date, format: "1990-01-01" "1991-02-02" ...

4. Displaying Data Frame

A number of functions can be used to show the whole or part of the data frame in order to examine the data.

print(members) # Print the whole data frame
##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03
## 4 Jamaine    168 1994-04-04
View(members) # View the data frame in data viewer in RStudio
head(audiobooks, n = 3) # Show the first 3 rows of the data frame. The default is 6 rows.
##   Rank           Title                                               Subtitle
## 1    1     Greenlights                                                   <NA>
## 2    2 The Ice Diaries The Untold Story of the Cold War's Most Daring Mission
## 3    3     Tiny Habits               The Small Changes That Change Everything
##                                                 Author            Narrator
## 1                                  Matthew McConaughey Matthew McConaughey
## 2 Captain William R. Anderson, Don Keith - contributor       Roger Mueller
## 3                                          BJ Fogg PhD         BJ Fogg PhD
##               Length  Release Language              Stars       Ratings  Price
## 1  6 hrs and 42 mins 10-20-20  English   5 out of 5 stars 7,172 ratings $28.00
## 2   10 hrs and 1 min 10-15-19  English   5 out of 5 stars     6 ratings $30.79
## 3 11 hrs and 22 mins 01-14-20  English 4.5 out of 5 stars   742 ratings $34.95
tail(audiobooks, n = 3) # Show the last 3 rows of the data frame. The default is 6 rows.
##     Rank               Title
## 98    98         If You Tell
## 99    99 Think and Grow Rich
## 100  100     The Housekeeper
##                                                                           Subtitle
## 98  A True Story of Murder, Family Secrets, and the Unbreakable Bond of Sisterhood
## 99                                                                            <NA>
## 100                                               A Twisted Psychological Thriller
##              Author         Narrator             Length  Release Language
## 98      Gregg Olsen     Karen Peakes 10 hrs and 34 mins 12-01-19  English
## 99    Napoleon Hill Erik Synnestvedt  9 hrs and 35 mins 10-16-07  English
## 100 Natalie Barelli    Susie Berneis   8 hrs and 4 mins 01-02-20  English
##                  Stars        Ratings  Price
## 98  4.5 out of 5 stars 11,827 ratings $29.99
## 99  4.5 out of 5 stars 21,869 ratings $24.95
## 100   4 out of 5 stars  6,503 ratings $34.99

The “dplyr” library provides some useful functions for data frame.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sample_n(audiobooks, size = 3) # Sample 3 rows randomly from the dataframe
##   Rank                       Title
## 1   83 A Kingdom of Flesh and Fire
## 2   80              Mount Fitz Roy
## 3   52 The Evening and the Morning
##                                        Subtitle                 Author
## 1 A Blood and Ash Novel (Blood and Ash, Book 2) Jennifer L. Armentrout
## 2                                          <NA>           Scott Sigler
## 3                           Kingsbridge, Book 4            Ken Follett
##        Narrator             Length  Release Language              Stars
## 1 Stina Nielsen 24 hrs and 21 mins 11-03-20  English   5 out of 5 stars
## 2    Ray Porter 29 hrs and 27 mins 12-03-20  English               <NA>
## 3      John Lee 24 hrs and 19 mins 09-15-20  English 4.5 out of 5 stars
##         Ratings  Price
## 1    69 ratings $30.09
## 2 Not rated yet $49.95
## 3 3,200 ratings $49.00
sample_frac(audiobooks, size = 0.05) ## Sample 5% of the rows randomly from the dataframe
##   Rank                      Title                                  Subtitle
## 1    1                Greenlights                                      <NA>
## 2    3                Tiny Habits  The Small Changes That Change Everything
## 3   66 The Fellowship of the Ring Book One in The Lord of the Rings Trilogy
## 4    9            A Promised Land                                      <NA>
## 5   77           The Way of Kings            The Stormlight Archive, Book 1
##                Author                     Narrator             Length  Release
## 1 Matthew McConaughey          Matthew McConaughey  6 hrs and 42 mins 10-20-20
## 2         BJ Fogg PhD                  BJ Fogg PhD 11 hrs and 22 mins 01-14-20
## 3    J. R. R. Tolkien                   Rob Inglis  19 hrs and 7 mins 10-09-12
## 4        Barack Obama                 Barack Obama 29 hrs and 10 mins 11-17-20
## 5   Brandon Sanderson Kate Reading, Michael Kramer 45 hrs and 30 mins 08-31-10
##   Language              Stars        Ratings  Price
## 1  English   5 out of 5 stars  7,172 ratings $28.00
## 2  English 4.5 out of 5 stars    742 ratings $34.95
## 3  English   5 out of 5 stars 45,670 ratings $38.49
## 4  English               <NA>  Not rated yet $45.50
## 5  English   5 out of 5 stars 73,172 ratings $63.93

5. Summarizing Data Frame

A summary of the data frame can be shown using the summary function. For character variables, it shows the mode among others. For numeric and date variables, it shows the mean, the minimum, the maximum, the median and the quartiles.

summary(members)
##      Name               Height         Birthday         
##  Length:4           Min.   :168.0   Min.   :1990-01-01  
##  Class :character   1st Qu.:169.5   1st Qu.:1990-10-25  
##  Mode  :character   Median :171.0   Median :1992-02-17  
##                     Mean   :172.5   Mean   :1992-02-17  
##                     3rd Qu.:174.0   3rd Qu.:1993-06-10  
##                     Max.   :180.0   Max.   :1994-04-04

6. Selecting Data from Data Frame

Selecting Cell(s)

Select a single cell from a data frame by using the row and column indexes.

data <- members[1, 1] # Row 1, column 1
print(data)
## [1] "Ricky"

Select multiple cells from a data frame by using the row and column indexes.

data <- members[1:2, 1:2] # Rows 1 and 2, columns 1 and 2
print(data)
##      Name Height
## 1   Ricky    170
## 2 Fatimah    172

Column names can be used in selecting the columns.

data <- members[1:2, c('Name', 'Height')] # Rows 1 and 2, columns "Name" and "Height"
print(data)
##      Name Height
## 1   Ricky    170
## 2 Fatimah    172

Selecting Row(s)

Select a single row from a data frame by using the row index.

data <- members[1, ] # Row 1
print(data)
##    Name Height   Birthday
## 1 Ricky    170 1990-01-01

Select multiple rows from a data frame by using the row numbers.

data <- members[1:2, ] # Rows 1 and 2
print(data)
##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
data <- members[c(1, 3), ] # Rows 1 and 3
print(data)
##      Name Height   Birthday
## 1   Ricky    170 1990-01-01
## 3 Kumanan    180 1993-03-03

Selecting Column(s)

Select a single column from a data frame by using the column index.

data <- members[ , 1] # Column 1
print(data)
## [1] "Ricky"   "Fatimah" "Kumanan" "Jamaine"

Select multiple columns from a data frame by using the column indexes.

data <- members[ , 1:2] # Columns 1 and 2
print(data)
##      Name Height
## 1   Ricky    170
## 2 Fatimah    172
## 3 Kumanan    180
## 4 Jamaine    168
data <- members[, c(1, 3)] # Columns 1 and 3
print(data)
##      Name   Birthday
## 1   Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04

Column names can be used in selecting the columns.

data <- members[ , c('Name', 'Birthday')] # Columns "Name" and "Birthday"
print(data)
##      Name   Birthday
## 1   Ricky 1990-01-01
## 2 Fatimah 1991-02-02
## 3 Kumanan 1993-03-03
## 4 Jamaine 1994-04-04

A column can be selected using the column name in bracket.

data <- members['Name']
print(data)
##      Name
## 1   Ricky
## 2 Fatimah
## 3 Kumanan
## 4 Jamaine

Selecting Using Logical Vectors

Data can be selected using logical vectors.

data <- members[c(T, F, T, F), c(T, T, F)] # Select rows 1 and 3, and columns 1 and 2
print(data)
##      Name Height
## 1   Ricky    170
## 3 Kumanan    180

Selecting Using $ Operator

A column can be selected using the format dataframe#column.

data <- members$Name
print(data)
## [1] "Ricky"   "Fatimah" "Kumanan" "Jamaine"

Selecting Based on Condition

Data can be conditionally selected by including a condition in bracket.

data <- members[members$Height > 170, ] # Show members with height more than 170cm
data
##      Name Height   Birthday
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

Data can be conditionally selected by using the subset() function.

data <- subset(members, members$Height > 170) # Show members with height more than 170cm
data
##      Name Height   Birthday
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

7. Returning Data Frame vs Vector

When single brackets [] are used, the data is returned as a dataframe.

When double brackets [[]] and $ are used, the data is returned as a vector.

data <- members[1]
class(data) # Returns data frame
## [1] "data.frame"
data <- members[[1]]
class(data) # Returns vector
## [1] "character"
data <- members['Name']
class(data) # Returns data frame
## [1] "data.frame"
data <- members[['Name']]
class(data) # Returns vector
## [1] "character"
data <- members$Name
class(data) # Returns vector
## [1] "character"

The “drop = FALSE” argument can be used to return a data frame instead of a vector.

data <- members[, 1] # By default, drop = TRUE
is.vector(data)
## [1] TRUE
is.data.frame(data)
## [1] FALSE
class(data) # Returns vector
## [1] "character"
data <- members[, 1, drop = FALSE] # Set drop = FALSE
is.vector(data)
## [1] FALSE
is.data.frame(data)
## [1] TRUE
class(data) # Returns data frame
## [1] "data.frame"

8. More Fun with Data Frame!

Data can be sorted using the order() function given a column name.

height_order <- order(members$Height)
print(height_order)
## [1] 4 1 2 3
members[height_order, ]
##      Name Height   Birthday
## 4 Jamaine    168 1994-04-04
## 1   Ricky    170 1990-01-01
## 2 Fatimah    172 1991-02-02
## 3 Kumanan    180 1993-03-03

Quickly visualize the data in a data frame using the plot function!

plot(members)

Don’t forget to save any updated data frame to a CSV file by using the write.csv() function.

write.csv(members, 'members.csv')

And finally, it’s….

The end!

Picture Source: https://www.pexels.com